Send commit concurrently in client side by zuston · Pull Request #59 · apache/uniffle

zuston · 2022-07-16T10:44:26Z

What changes were proposed in this pull request?

Sending commit concurrently in client side

Why are the changes needed?

I found when using the LOCALFILE storageType, waiting the commit will cost too much time. To speed up, it can be sent commit concurrently by using thread pool.

Performance Test Case
Using 1000 executors of Spark, single executor 1g/1core to run TeraSort 1TB.

When using LOCALFILE storageType mode, it cost 7.3 min.
And then after applying this PR, it cost 6.1 min

Does this PR introduce any user-facing change?

Introducing the conf of rss.client.data.commit.pool.size, the default value is assigned shuffle server size.

How was this patch tested?

No need

codecov-commenter · 2022-07-16T10:58:06Z

Codecov Report

Merging #59 (e392f1e) into master (e48f74e) will decrease coverage by 0.04%.
The diff coverage is 8.57%.

@@             Coverage Diff              @@
##             master      #59      +/-   ##
============================================
- Coverage     55.21%   55.16%   -0.05%     
+ Complexity     1111     1110       -1     
============================================
  Files           148      148              
  Lines          7953     7962       +9     
  Branches        760      760              
============================================
+ Hits           4391     4392       +1     
- Misses         3321     3328       +7     
- Partials        241      242       +1

Impacted Files	Coverage Δ
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java	`87.50% <ø> (ø)`
...n/java/org/apache/hadoop/mapreduce/RssMRUtils.java	`31.70% <0.00%> (-0.40%)`	⬇️
.../java/org/apache/spark/shuffle/RssSparkConfig.java	`88.88% <ø> (ø)`
...e/uniffle/client/factory/ShuffleClientFactory.java	`0.00% <ø> (ø)`
...rg/apache/uniffle/client/util/RssClientConfig.java	`0.00% <ø> (ø)`
...he/uniffle/client/impl/ShuffleWriteClientImpl.java	`25.95% <8.82%> (-0.04%)`	⬇️
.../apache/uniffle/coordinator/ClientConfManager.java	`91.54% <0.00%> (-1.41%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48f74e...e392f1e. Read the comment docs.

roryqi · 2022-07-16T12:07:40Z

Do you have performance tests? I guess this pr can't improve the performance. Because the performance bottleneck of commit operation is on the shuffle server in my opinion.

zuston · 2022-07-16T12:36:50Z

Yes. I tested
I use 1000 executors, single executor 1g/1core to run terasort 1TB.

When using localfile mode, it cost 7.3 min.
And when i apply this PR, it cost 6.1 min

@jerqi

zuston · 2022-07-16T12:42:05Z

Do you have performance tests? I guess this pr can't improve the performance. Because the performance bottleneck of commit operation is on the shuffle server in my opinion.

As I know the spilling to disk event need to be triggered by client side. So if the previous trigger is blocked, the next one will
not be triggered.

roryqi · 2022-07-16T12:59:39Z

We don't recommend to use the storageType LOCALFILE, because it has poor performance. But the improvement is ok for me.

roryqi · 2022-07-16T13:15:26Z

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

  }

+  /**
+   * This method will wait until all shuffle data have been spilled


spilled -> flushed.

roryqi · 2022-07-16T13:19:35Z

Yes. I tested I use 1000 executors, single executor 1g/1core to run terasort 1TB.

When using localfile mode, it cost 7.3 min. And when i apply this PR, it cost 6.1 min

@jerqi

Please put performance test results into Why are the changes need?

roryqi · 2022-07-16T13:21:00Z

client/src/main/java/org/apache/uniffle/client/util/RssClientConfig.java


 package org.apache.uniffle.client.util;

+import org.apache.hadoop.io.OutputBuffer;


Why do we need this?

roryqi · 2022-07-16T13:24:56Z

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

  public ShuffleWriteClientImpl(String clientType, int retryMax, long retryIntervalMax, int heartBeatThreadNum,
                                int replica, int replicaWrite, int replicaRead, boolean replicaSkipEnabled,
-                                int dataTranferPoolSize) {
+                                int dataTranferPoolSize, int commitSenderPoolSize) {


We prefer the code style as below

public ShuffleWriteClientImpl( String clientType, int retryMax, long retryIntervalMax, int heartBeatThreadNum, int replica, int replicaWrite, int replicaRead, boolean replicaSkipEnabled, int dataTranferPoolSize, int commitSenderPoolSize) {

roryqi · 2022-07-16T13:27:36Z

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRConfig.java

+  public static final String RSS_COMMIT_SENDER_POOL_SIZE =
+      MR_RSS_CONFIG_PREFIX + RssClientConfig.RSS_COMMIT_SENDER_POOL_SIZE;
+  public static final int RSS_COMMIT_SENDER_POOL_SIZE_DEFAULT_VALUE =
+      RssClientConfig.RSS_COMMIT_SENDER_POOL_SIZE_DEFAULT_VALUE;


The name's style should be consistent with data_transfer_pool_size. How about data_commit_pool_size?

roryqi · 2022-07-16T13:29:30Z

Could you update the document about this feature?

roryqi · 2022-07-16T13:43:47Z

4b5389f
In this pr, we use method stream to replace method parallelStream. It may be a bad choice. Method registerShuffleServer use method stream, too. Is it possible to improve performance to use method parallelStream in method registerShuffleServer? Will it create too many forkjoinPool?

zuston · 2022-07-16T13:48:58Z

If we close the forkjoin pool in the scope of method. I think it’s ok.

roryqi · 2022-07-16T14:08:17Z

If we close the forkjoin pool in the scope of method. I think it’s ok.

Ok

zuston · 2022-07-16T14:08:55Z

We don't recommend to use the storageType LOCALFILE, because it has poor performance. But the improvement is ok for me.

The performance of LOCALFILE looks better than ess. Due to no need to wait data flushed to disk, the MEMORY_LOCALFILE will better.

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRConfig.java

zuston · 2022-07-16T14:44:50Z

Besides I think i can submit new PR to let registerShuffleServer do the same optimization

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

client-spark/spark2/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java

client/src/main/java/org/apache/uniffle/client/factory/ShuffleClientFactory.java

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

roryqi · 2022-07-16T14:50:26Z

Besides I think i can submit new PR to let registerShuffleServer do the same optimization

We'd better have performance tests. RegisterShuffleServer may not cost too much time. The optimization have less effect.

roryqi · 2022-07-16T15:43:19Z

client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java

-    });
+        });
+      }).join();
+    } catch (Exception e) {


Should we use

finally { forkJoinPool.shutdownNow(); }

My fault…..

roryqi · 2022-07-16T16:20:52Z

Could you update the document because this pr introduce the user-facing change?

zuston · 2022-07-17T02:23:48Z

Done @jerqi

roryqi

LGTM

Sending commit concurrently in client side

a1105a3

roryqi reviewed Jul 16, 2022

View reviewed changes

optimize

5fa3d98

roryqi reviewed Jul 16, 2022

View reviewed changes

client-mr/src/main/java/org/apache/hadoop/mapreduce/RssMRConfig.java Outdated Show resolved Hide resolved

fix

57cb2e9

roryqi reviewed Jul 16, 2022

View reviewed changes

optimize 1

9fbab16

roryqi reviewed Jul 16, 2022

View reviewed changes

zuston added 2 commits July 16, 2022 23:50

fix

16b8065

fix again

a6cf0a0

roryqi changed the title ~~Sending commit concurrently in client side~~ Send commit concurrently in client side Jul 16, 2022

Update doc

e392f1e

roryqi approved these changes Jul 17, 2022

View reviewed changes

roryqi merged commit c3616c2 into apache:master Jul 17, 2022


		package org.apache.uniffle.client.util;

		import org.apache.hadoop.io.OutputBuffer;

Conversation

zuston commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov-commenter commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

roryqi commented Jul 16, 2022

Uh oh!

zuston commented Jul 16, 2022

Uh oh!

zuston commented Jul 16, 2022

Uh oh!

roryqi commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roryqi Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

roryqi commented Jul 16, 2022

Uh oh!

roryqi Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

roryqi Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roryqi commented Jul 16, 2022

Uh oh!

roryqi commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuston commented Jul 16, 2022

Uh oh!

roryqi commented Jul 16, 2022

Uh oh!

zuston commented Jul 16, 2022

Uh oh!

Uh oh!

zuston commented Jul 16, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roryqi commented Jul 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roryqi Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

zuston Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

roryqi commented Jul 16, 2022

Uh oh!

zuston commented Jul 17, 2022

Uh oh!

roryqi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zuston commented Jul 16, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited

Loading

roryqi commented Jul 16, 2022 •

edited

Loading

roryqi Jul 16, 2022 •

edited

Loading

roryqi Jul 16, 2022 •

edited

Loading

roryqi commented Jul 16, 2022 •

edited

Loading

roryqi commented Jul 16, 2022 •

edited

Loading